The data I’ll be using here are a subset of a larger database of observations of marine fauna sampled over 12+ years in the Mediterranean Sea by a citizen science programm in France called Cybelle Mediterranee. The program is run by a non-profit organisation called Cybelle Planete, and they co-developed a free app called ObsEnMer so that anyone can use it to upload their observations and geolocate them. The whole dataset with many more variables can be dowloaded from their website, but you must create a profile which has to be validated before you have access to the data, and that may take a while.
I’ve created a subset of these data that I’ve processed (selected relevant columns, renamed them in english, etc), and I’ll happily send it to you. Simply contact me at camille.coux@orange.fr, and let me know what you intend to do with the data.
They have different sampling designs, and here I’ll be using the more advanced ones. In this case, observations were collected by volunteers. They sign up for week-long sampling sessions on a sailing boat and are supervised by a trained “ecoguide”, since most volunteers have no prior experience in marine animal identification, nor any particular skills in ecology, biology, oceanography or even sailing.
Observations are thus collected following random transects, i.e. a linear trajectory at a fixed speed, that lasts for at least 15 minutes, but can last several hours. Mainly because of weather and technical issues, these transects do no aim to be resampled more than once, but may very well overlap or cross trajectories over different sessions. While sampling occurs, the aim is to log in any sightings of marine fauna that ca be seen from the surface. Species are mainly cetaceans and birds, but also turtles, fish, macroplancton (mainly jellyfish), rays and a couple species of shark. Data are collected at regular time intervals, such that even when no animals are to be seen, a record is still logged into the dataset and hence corresponds to an absence (NA in the dataset). However, if an animal is detected, the observation is recorded even if the sighting happens in beteen 2 time intervals.
In addition to the observation data, I’ve also compiled measures of bathymetry, chlorophyll a concentration, and sea surface temperature from GEBCO for bathymetry, and oceancolor, for the latter 2.
To get the raw data, you may visit these websites (or orther hubs like Corpenicus for instance) and dowload them or file a request when necessary.
I also created a grid over the NW Mediterranean basin for the purpose of this analysis.
All files are necessary to run this analysis, and as for the other variables, I can send the processed versions I used for this analysis along with the observation data.
The aim is to conduct an occupancy analysis, and produce a number of maps to viualise the predictions. In this example, I’ll be processing the data and formatting it for the unmarked framework used to run the analysis.
Then we’ll run a occupancy model. It’s not super well tailored for our kind of data as we’ll see later on, so I’ll pass really quickly over the whole analysis part to go straight to the outputs.
library(magrittr)
library(tidyr)
library(dplyr)
library(unmarked)
library(ggplot2)
library(sf)
library(raster)
library(ncdf4)
library(mapview)
library(maps)
library(maptools)
# a few preferred options
options(scipen = 999)
theme_set(theme_minimal())
# import obs data
track <- read.csv2("../data/track.csv", row.names = 1)
# check
str(track)
## 'data.frame': 329949 obs. of 18 variables:
## $ index : int 182849 182850 182851 182852 182853 182854 182855 182856 182857 182858 ...
## $ year : int 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
## $ month : int 7 7 7 7 7 7 7 7 7 7 ...
## $ day : int 26 26 26 26 26 26 26 26 26 26 ...
## $ yday : int 207 207 207 207 207 207 207 207 207 207 ...
## $ datetime : chr "26/07/2009 08:13:00" "26/07/2009 08:49:00" "26/07/2009 08:53:00" "26/07/2009 08:55:00" ...
## $ ref : chr "ce u2194" "ce u2194" "ce u2194" "ce u2194" ...
## $ long : num 6.38 6.4 6.4 6.4 6.4 ...
## $ lat : num 43 43 42.9 42.9 42.9 ...
## $ group : chr NA NA NA NA ...
## $ species : chr NA NA NA NA ...
## $ n : int NA NA NA NA NA NA NA NA NA NA ...
## $ n.abun : num NA NA NA NA NA NA NA NA NA NA ...
## $ protocole2 : chr "experte" "experte" "experte" "experte" ...
## $ bathymetry : num 103 103 1464 1464 1464 ...
## $ chla : num 0.159 0.159 0.159 0.159 0.159 ...
## $ sst : num 22.1 22.1 22.1 22.1 22.1 ...
## $ site.to.coast: num 10.3 10.3 16.6 16.6 16.6 ...
These are the columns we’ll be interested in:
# import grid:
m_grid <- st_read("../data/med_grid.shp", crs=4326)
## Reading layer `med_grid' from data source `/Users/camillecoux/Documents/Cybelle Planete/CybelleMed/cybelle/data/med_grid.shp' using driver `ESRI Shapefile'
## Simple feature collection with 9852 features and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -5.916665 ymin: 30.41668 xmax: 36.25084 ymax: 45.91698
## Geodetic CRS: WGS 84
# view grid in browser:
# mapview(m_grid$geometry,viewer.suppress = TRUE )
m_grid : 9852 cells across the Mediterranean Sea
We’ll be extrapolating the predictions of the occupancy model to the whole grid (even though only a few of these cells were sampled) later on. For now we just need the cell numbers to match with the obs data, so we intersect :
# intersect obs data with grid cells to get the grid_id column in the track dataset :
inter <- track %>%
st_as_sf(coords=c("long", "lat"), crs=4326) %>%
st_intersection(m_grid, track_sf) %>%
rename(grid_id = FID)
track <- left_join(track, inter%>%dplyr::select(grid_id, index))
# select observations for a given species, e.g. striped dolphin species,
# i.e. "dauphin bleu et blanc" in french:
presences <- track[grep("Dauphin Bleu", track$species),]
# selection of all absences
absences <- track[which(is.na(track$species)),]
# merge
striped_dolphin <- rbind(presences, absences)
# takeout presence only obseravtions
striped_dolphin <- striped_dolphin[-grep("ponctuelle", striped_dolphin$protocole),]
# keep observations from the summer months only since they concentrate most obseravtions
presences$month %>% table %>% barplot
striped_dolphin <- striped_dolphin %>%
filter(month %in% 6:9)
# check which years have enough data : keep data from 2015 to 2020
table(striped_dolphin$month, striped_dolphin$year)
##
## 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
## 6 0 0 0 0 0 0 210 6729 4971 8212 9748 688
## 7 562 1286 1127 436 2456 2605 389 14572 13444 28870 26831 9788
## 8 467 340 837 1025 3330 3792 361 17236 34273 21957 26036 13333
## 9 0 0 0 0 0 0 148 6354 13289 8946 7715 6159
striped_dolphin <- striped_dolphin %>% filter(year %in% c(2015:2020))
# select columns necessary for the analysis, and remove NAs
d <- striped_dolphin %>%
dplyr::select(bathymetry, site.to.coast, chla, sst) %>%
complete.cases
striped_dolphin <- striped_dolphin[d,]
(see ?unmarkedFrameOccu for more info on this).
For this, we need 3 things :
The site covariates are the ones that do not change from one year to the other. In this case, bathymetry and site.to.coast are site covariates, whereas chla and sst are obseration covariates, since repeated measures at different times are unlikely to yield the same values.
# 1. create An RxJ matrix of the detection, non-detection data, where:
# R = number of sites
# J = maximum number of sampling periods per site
RJ_mat <- striped_dolphin %>%
group_by(grid_id, year) %>%
summarise(n=sum(n, na.rm = T)) %>%
pivot_wider(names_from = year, values_from=n, names_prefix = "Y")
RJ_mat <- RJ_mat[,c(1, 7, 3, 2, 6, 4, 5)] # make sure years are in the right order
Before we prepare the covariate dataframes, we need to compute the mean values of each observation of a given year that fall in the same grid cell fromthe m_grid. This is because we’re using years as secondary occasions, even though there is more detail in the $chla and $sst values of the animal sightings “track” dataframe.
# Select observation variables : chla and SST
# calculate the mean chla and sst values for each grid cell:
chla <- striped_dolphin %>%
group_by(grid_id, year) %>%
summarise(chla = mean(chla, na.rm = T)) %>%
pivot_wider(names_from = year, values_from=chla, names_prefix = "chla") %>%
as.data.frame
chla <- chla[,c(1, 7, 3, 2, 6, 4, 5)] # make sure years are in the right order
sst <- striped_dolphin %>%
group_by(grid_id, year) %>%
summarise(sst = mean(sst, na.rm = T)) %>%
pivot_wider(names_from = year, values_from=sst, names_prefix = "sst") %>%
as.data.frame
sst <- sst[,c(1, 7, 3, 2, 6, 4, 5)] # make sure years are in the right order
# merge chla and sst into single observation matrix:
obscovs <- list(chla[,-1] , sst[,-1])
names(obscovs) <- c("chla", "sst")
# Select site covariates : bathymetry and distance from site to coast
# calculate the mean bathymetry and site.to.coast values for each grid cell:
sitecovs <- striped_dolphin %>%
group_by(grid_id) %>%
summarise(bathymetry.sites = round(mean(bathymetry, na.rm=T), digits = 1),
site.to.coast = round(mean(site.to.coast, na.rm=T), digits = 1))
# make the unmarked data frame
umf <- unmarkedFrameOccu(y = RJ_mat[,-1] %>% as.data.frame,
siteCovs = sitecovs[,-1] %>% as.data.frame,
obsCovs = obscovs)
# scale covariates and store values for later
sc <- scale(siteCovs(umf))
siteCovs(umf) <- sc
scobs <- scale(obsCovs(umf))
obsCovs(umf) <- scobs
head(umf) # look at data
## Data frame representation of unmarkedFrame object.
## y.1 y.2 y.3 y.4 y.5 y.6 bathymetry.sites site.to.coast chla.1 chla.2
## 1 NA NA 0 NA NA NA -1.30743639 -0.9951476 NA NA
## 2 NA NA 2 NA NA NA -0.21262905 -0.7443100 NA NA
## 3 NA NA 0 NA NA NA 0.07277998 -0.7861163 NA NA
## 4 NA NA 0 NA NA NA -0.30145583 -0.9694207 NA NA
## 5 NA NA 0 NA NA NA 0.06292129 -0.9372620 NA NA
## 6 NA NA 1 NA NA NA 0.71517211 -0.8150591 NA NA
## 7 NA NA 0 NA NA NA 0.77077511 -0.4548820 NA NA
## 8 NA 0 NA NA NA NA -1.30457737 -0.7796846 NA -0.2592282
## 9 NA 0 NA NA NA NA -1.31611203 -0.7346624 NA -0.2622702
## 10 NA 0 NA NA NA NA -1.33839267 -0.9662048 NA -0.2748914
## chla.3 chla.4 chla.5 chla.6 sst.1 sst.2 sst.3 sst.4 sst.5 sst.6
## 1 -0.3461244 NA NA NA NA NA 0.9360963 NA NA NA
## 2 -0.3306177 NA NA NA NA NA 1.0634697 NA NA NA
## 3 -0.3660105 NA NA NA NA NA 1.0366129 NA NA NA
## 4 -0.3275184 NA NA NA NA NA 1.0381730 NA NA NA
## 5 -0.3566758 NA NA NA NA NA 1.0195624 NA NA NA
## 6 -0.3422768 NA NA NA NA NA 0.9934395 NA NA NA
## 7 -0.3540590 NA NA NA NA NA 0.9582191 NA NA NA
## 8 NA NA NA NA NA 0.5849335 NA NA NA NA
## 9 NA NA NA NA NA 0.6383203 NA NA NA NA
## 10 NA NA NA NA NA 0.6376535 NA NA NA NA
summary(umf) # summarize
## unmarkedFrame Object
##
## 483 sites
## Maximum number of observations per site: 6
## Mean number of observations per site: 1.91
## Sites with at least one detection: 157
##
## Tabulation of y observations:
## 0 1 2 3 4 5 6 7 8 9 10 13 14 15 17 18
## 602 141 64 53 13 12 6 10 8 6 2 2 1 1 1 1
## 20 <NA>
## 1 1974
##
## Site-level covariates:
## bathymetry.sites site.to.coast
## Min. :-1.40671 Min. :-1.2203
## 1st Qu.:-0.94700 1st Qu.:-0.8826
## Median :-0.09669 Median :-0.3070
## Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 1.03193 3rd Qu.: 0.7286
## Max. : 2.13591 Max. : 3.3913
##
## Observation-level covariates:
## chla sst
## Min. :-0.6572 Min. :-4.3117
## 1st Qu.:-0.3101 1st Qu.:-0.7603
## Median :-0.1922 Median : 0.0902
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.0052 3rd Qu.: 0.7153
## Max. :13.0916 Max. : 2.3909
## NA's :1974 NA's :1974
# run model
occu.model = occu(~chla + sst ~ bathymetry.sites+site.to.coast, umf)
To the rest of the m_grid cells. To do this, we need values of the covariates at each of the cells. As an example, I chose to use the mean values measured for May 2020, and prepared their extraction in the /code/extract_env_data.R file.
# read in Med grid with environmental variables extracted for May 2020:
source("extract_env_data.R")
## Reading layer `med_grid' from data source `/Users/camillecoux/Documents/Cybelle Planete/CybelleMed/cybelle/data/med_grid.shp' using driver `ESRI Shapefile'
## Simple feature collection with 9852 features and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -5.916665 ymin: 30.41668 xmax: 36.25084 ymax: 45.91698
## Geodetic CRS: WGS 84
# there will be warnings, they're ok for this purpose
# check NAs
m_grid_may20 %>% apply(., 2, is.na) %>% colSums
## FID geometry bathymetry chla sst
## 0 0 0 44 1
## site.to.coast
## 0
# remove lines with NAs:
m_grid_may20 <- m_grid_may20[-which(is.na(m_grid_may20$chla)), ]
# values used to standardise the unmarked dataframe variables: need to apply
# the same standardisation values to all cells of m_grid_may20
# sc
mean_Bathy <- attributes(sc)$`scaled:center`[1]
sd_Bathy <- attributes(sc)$`scaled:scale`[1]
bathy.s <- (m_grid_may20$bathymetry - mean_Bathy) / sd_Bathy
# occupancy model estimates for bathymetry
summary(occu.model)
##
## Call:
## occu(formula = ~chla + sst ~ bathymetry.sites + site.to.coast,
## data = umf)
##
## Occupancy (logit-scale):
## Estimate SE z P(>|z|)
## (Intercept) 2.304 1.022 2.25 0.024166
## bathymetry.sites 3.827 1.034 3.70 0.000215
## site.to.coast -0.821 0.508 -1.62 0.105694
##
## Detection (logit-scale):
## Estimate SE z P(>|z|)
## (Intercept) 0.0320 0.110 0.291 0.770844
## chla 1.3436 0.401 3.348 0.000815
## sst 0.0827 0.108 0.767 0.442939
##
## AIC: 1068.379
## Number of sites: 483
## optim convergence code: 0
## optim iterations: 54
## Bootstrap iterations: 0
(beta <- coef(occu.model, type="state"))
## psi(Int) psi(bathymetry.sites) psi(site.to.coast)
## 2.3035913 3.8267041 -0.8214672
logit.psi <- beta[1] + beta[2]*bathy.s
psi <- exp(logit.psi) / (1 + exp(logit.psi))
# And now same things with chla :
# scobs
mean_chla =attributes(scobs)$`scaled:center`[1]
sd_chla= attributes(scobs)$`scaled:scale`[1]
chla.s <- (m_grid_may20$chla - mean_chla) / sd_chla
# occupancy estimates from model p(chla)
(beta.det <- coef(occu.model, type="det"))
## p(Int) p(chla) p(sst)
## 0.03199510 1.34358190 0.08269597
logit.p <- beta.det[1] + beta.det[2]*chla.s
p <- exp(logit.p) / (1 + exp(logit.p))
# for later:
labs <- c("0-10%", "10-20%", "20-30%", "30-40%", "40-50%", "50-60%", "60-70%", "70-80%", "80-90%", "90-100%")
m_grid_may20$bathy_prediction <- psi
# get quantiles of occupancy estimates
grid_occ <- quantile(psi,probs=seq(0, 1, 0.1), na.rm=T)
m_grid_may20$psi_bathy_quantiles <- cut(psi,breaks= grid_occ,labels=labs)
# 1. the actual bathymetry measures
mapview(m_grid_may20, viewer.suppress=T, zcol="bathymetry")
# 2. occupancy predictions based on bathymetry variable
mapview(m_grid_may20, viewer.suppress=T, zcol="bathy_prediction")
# 3. occupancy predictions based on bathymetry variable - discretised using quantiles
labs <- c("0-10%", "10-20%", "20-30%", "30-40%", "40-50%", "50-60%", "60-70%", "70-80%", "80-90%", "90-100%")
mapview(m_grid_may20, viewer.suppress=T, zcol="psi_bathy_quantiles")
The distance to the coast wasn’t significant.
What about the chla effect ?
m_grid_may20$chla_prediction <- p
# get quantiles of occupancy estimates
grid_occ_p <- quantile(p,probs=seq(0, 1, 0.1), na.rm=T)
round(grid_occ_p,2)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 0.26 0.30 0.31 0.32 0.34 0.35 0.37 0.40 0.43 0.50 1.00
grid_chla_quantiles <- quantile(m_grid_may20$chla, probs=seq(0, 1, 0.1), na.rm=T)
m_grid_may20$chla_quantiles <- cut(m_grid_may20$chla,
breaks=grid_chla_quantiles,labels=labs)
m_grid_may20$p_chla_quantiles <- cut(p,breaks= grid_occ_p,labels=labs)
# maps
# 1. the actual chla measures
mapview(m_grid_may20, viewer.suppress=T, zcol="chla")
# 2. actual chla measures but binned in quantiles
mapview(m_grid_may20, viewer.suppress=T, zcol="chla_quantiles")
# 3. occupancy predictions based on chla variable
fig <- mapview(m_grid_may20, viewer.suppress=T, zcol="chla_prediction")
# To make these into .png files, this used to work, but generates an error now.
# Maybe because of my recent R update ?
mapshot(m, file ="striped_dolphin_prediction_chla.png",
map.types = "CartoDB.Positron")
# 4. occupancy predictions based on chla variable - discretised using quantiles
mapview(m_grid_may20, viewer.suppress=T, zcol="p_chla_quantiles")